Skip to content

Conversation

@Luap99
Copy link
Member

@Luap99 Luap99 commented Nov 25, 2025

see individual commits

Does this PR introduce a user-facing change?

Fixed a possible nil deref during podman system migrate.
Improved the reliability  of the podman rootless userns creating when conmon and the pause process where killed unexpectedly.

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 25, 2025
@Luap99
Copy link
Member Author

Luap99 commented Nov 25, 2025

@mheon @giuseppe PTAL

if became {
os.Exit(ret)
}
// FIXME: should this be an error? If we never re-exec that seems like a big issue here as we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least drop a logrus.Errorf here? See if people actually report seeing this

Copy link
Member Author

@Luap99 Luap99 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this on of the return to satisfy the compiler and not something that should ever be reached by my reading of the code so I think even a hard error should be safe, I just added the fixme to get your and others attention to it in case I am was missing something.

I am fine with both a logrus.Error or even hard error. I am somewhat confident in the 550-pause-process.bats test cases that should check all the cases. The one thing the tests cannot do of course is check what happens if we reach here in parallel.

I wonder if taking the alive lock here would be a good idea.

@mheon
Copy link
Member

mheon commented Nov 25, 2025

Overall LGTM minus one comment, but we really need a Giuseppe review to merge, my grasp on this code isn't as good as I'd like

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

There is no good reason to use logrus and os.Exit() here, other parts of
this function already return the error so do the same. The main podman
process will exit then with the normal formatted error message.

And also log an error about the last return which should never happen as
we should have exited above if the re-exec worked or errored out.

Signed-off-by: Paul Holzinger <[email protected]>
Based on the description in commit 63ef557 this was added so that the
migrate command does not move the pause process into a separate cgroup.

It should however not disable the rejoining of the userns when the pause
process join failed. BEcause of this we end up calling migrate without a
userns and that then can fail if there are actual contianer it tries to
cleanup.

Fixes: 63ef557 ("command: migrate doesn't move process to cgroup")

Signed-off-by: Paul Holzinger <[email protected]>
Just a minor improvement as we know the size needed for the slice we can
allocate it only once instead of the append having to resize it.

Signed-off-by: Paul Holzinger <[email protected]>
When trying to join the conmon pid to recreate the pause process based
on the namespace it can be that the pid is no longer valid, i.e. when
conmon crashed or was killed.

Currently we have a big issue that can be reproduced using:
$ podman run -d quay.io/libpod/testimage:20241011 sleep 100
$ killall -9 conmon
$ killall catatonit

All commands would fail as we keep trying to rejoin the namespace of the
non existing conmon process.

So to address that fall back to creating a new namespace if we fail to
join the conmon pids.

Signed-off-by: Paul Holzinger <[email protected]>
There doesn't seem any reason why the system commands should not join
the userns. In particular the main commands use ParentNSRequired and
UnshareNSRequired when they don't want to be joined to the main userns.
Since the system command don't set these the go code does the join and
re-exec anyway so might as well use the shortcut to speed that up.

Signed-off-by: Paul Holzinger <[email protected]>
@mheon
Copy link
Member

mheon commented Nov 25, 2025

Sure
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 25, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 4ca9158 into containers:main Nov 25, 2025
80 checks passed
@Luap99 Luap99 deleted the migrate branch November 25, 2025 19:39
@Luap99
Copy link
Member Author

Luap99 commented Nov 26, 2025

/cherry-pick v5.7

@openshift-cherrypick-robot
Copy link
Collaborator

@Luap99: new pull request created: #27611

In response to this:

/cherry-pick v5.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants